MP Benchmarks, Roy Longbottom's PC benchmark Collection

MultiThreading Benchmarks

Roy Longbottom

General	Whetstone Benchmarks	Assembler Code Add Benchmark
BusSpeed Benchmark	RandMem Benchmarks	MP MFLOPS Benchmarks
OpenMP and QPAR Benchmarks

General

These benchmarks execute the same code as the original, designed to exercise a single CPU, but implementing multithreading to use up to all available cores. Some employ a single method of manual procedures, where there might be more suitable options, with others using OpenMP and QPAR to automatically generate parallelism. In most cases, 32 bit and 64 bit compilations are provided for Windows and Linux.

This report includes detailed results for a quad core, eight thread 3.9 GHz Intel Core i7 CPU, and provides links to others covering various different CPUs.

Whetstone Benchmark - is mainly dependent on floating point speed but with some independently timed integer test functions. Each thread executes shared code using mainly L1 cache based independent variables, leading to performance being proportional to the number of cores, or higher with hyperthreading.

Assembly Code Arithmetic - These execute integer and SSE floating point add instructions via independent threads. On that i7, it demonstrates, via four cores, up to 61.5 GFLOPS (max spec 62.4) or 12.3 Integer MIPS per MHz.

BusSpeed MP Benchmark - provides read only access to data in caches and RAM. It is intended to demonstrate bus operation and speed where data is transferred in bursts and maximum data transfer speed. In the original Windows version, each thread read all the data, starting at the same point. This had to be modified for Linux, due to excessive impact of caching. Cache based tests demonstrate up to 62 GB/second per core, with RAM 16 GB/second, using 1 thread, and up to 40 GB/second via 8 threads, 78% of maximum specification.

RandMem MP Benchmark - The program uses the same code for serial and random access via a complex indexing structure and comprises Read and Read/Write tests, covering data from caches and RAM. This benchmark uses data from the same array for all threads, but starting at different points. Serial reading is slower than BusSpeed MP but with similar multithreading gains. Random reading speed can be reduced due to burst reading over buses from caches and particularly RAM, but benefits from multithreading. Read/Write tests produce the worst performance characteristics, where single thread operation can be faster than using multiple threads, particularly with random access.

MP MFLOPS Benchmark - The benchmark carries out calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word, via caches and RAM. Each thread deals with separate segments of the data, via shared code, fully demonstrating multithreading speed gains. Performance is highly dependent on ability of a compiler available at production time, in this case being using old i387, SSE or AVX1 instructions, and whether full SIMD is implemented for the latter. For the i7 based PC, quad core single precision GFLOPS are shown to be up to 24 with i87 or SISD, 94 with SIMD and 177 with AVX1. The calculations make use of linked multiply and add instructions, with a maximum of 249.6 GFLOPS for AVX1.

OpenMP and QPAR MFLOPS Benchmarks - The benchmarks carry out the same calculations as MP MFLOPS Benchmark, essentially using the same code, without any OpenMP code requirements, but with critical loops preceded by a simple “go parallel” directive. QPAR is a Microsoft alternative to OpenMP. With the results shown, OpenMP maximum speeds of 24 GFLOPS are demonstrated via Windows and 91 GFLOPS using QPAR. Linux results show 23 GFLOPS for 32 bit compilations, with 50 GFLOPS at 64 bits, improving to 94 GFLOPS with an AVX compile option.

Go To Start

Whetstone MP Benchmark

The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Further details and performance of early systems can be found in Whetstone Benchmark History and Results.

The overall performance rating was later upgraded to Millions of Whetstone Instructions Per Second (MWIPS from KWIPS) and the speed of the eight different test functions provided, in terms of Millions of Operations Per Second (MOPS) or MFLOPS for floating point calculations. Three PC multithreading versions are available, with results for all being included in Whetstone Benchmark Detailed Later Results. All come in 32 bit and 64 bit versions. The benchmarks effectively run independent threads, possibly demonstrating the best multithreading performance. Full samples of logged performance details are provided below. They are all for the same 3.9 GHz Core i7 CPU and demonstrate variability produced by different compilations, with source code in newsource.zip. and further details in dualcore.htm. As indicated, these are dual core benchmarks. They use independent code and data.

Later Windows Versions - whets8thread32.exe and whets8Thread64.exe and source code can be found in quadcore.zip . Further details are included in quad core 8 thread.htm. In this case, code is shared between threads, but each has its own data. The benchmarks run 1, 2, 4, 6 and 8 threads. The results below are for the quad core i7 processor with hyperthreading, that provides significant additional performance gains.

Linux Versions - whetsMP32, whetsMP32DP, whetsMP64 and whetsMP64DP can be downloaded in linux_multithreading_apps.tar.gz , along with source code. Further details can be found in linux multithreading benchmarks.htm. This multithreading benchmark also has a run time parameter to specify the number of threads (up to 64) with a default identified as configured CPUs in gathered system information.

Windows 2 Threads - 64 Bit Version Whetstone Single Precision MP SSE Benchmark Fri Jul 30 15:51:13 2010 Via Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 1829 32308 8142 2157 1967 1442 263 105 5331 2904 17427 Thread 1 1078 983 765 132 52.1 2653 1450 16339 Thread 2 1079 984 677 131 52.6 2678 1454 1088 ############################################################################ Windows 1 to 8 Threads - 32 Bit Version Whetstone Single Precision 8 Thread Benchmark Mon May 12 10:17:45 2014 Via Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86 MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS 1055 15341 3672 1243 1059 893 82.7 50.4 3502 5566 1482 Thread 1 1243 1059 893 82.7 50.4 3502 5566 1482 2112 30669 7343 2486 2120 1788 165 101 7004 11121 2963 Thread 1 1243 1060 894 82.7 50.4 3502 5566 1481 Thread 2 1243 1060 893 82.6 50.4 3502 5555 1481 4217 60844 14635 4970 4239 3560 330 201 13986 22017 5852 Thread 1 1243 1060 891 82.6 50.3 3503 5566 1470 Thread 2 1241 1059 888 82.2 50.2 3489 5564 1464 Thread 3 1243 1060 890 82.6 50.2 3492 5449 1458 Thread 4 1243 1060 891 82.5 50.2 3502 5439 1459 6316 72487 20696 7434 6357 5333 459 288 19319 23188 6802 Thread 1 1239 1060 888 76.5 47.4 3237 3412 1159 Thread 2 1238 1058 888 77.1 47.4 3188 3894 1122 Thread 3 1240 1060 888 77.7 47.9 3206 3234 1157 Thread 4 1239 1059 890 75.4 48.6 3248 4508 1214 Thread 5 1240 1060 891 76.4 48.8 3227 4486 1061 Thread 6 1238 1060 888 76.4 48.0 3213 3654 1090 8406 80481 26596 9893 8473 7085 590 375 24845 22260 7541 Thread 1 1237 1059 886 73.8 46.9 3108 2782 943 Thread 2 1236 1058 886 73.7 46.9 3099 2782 943 Thread 3 1237 1059 883 73.7 46.9 3104 2782 942 Thread 4 1238 1060 886 73.8 46.9 3106 2783 943 Thread 5 1238 1060 885 73.8 46.9 3110 2783 943 Thread 6 1237 1059 886 73.7 46.9 3103 2782 942 Thread 7 1233 1059 886 73.7 46.9 3108 2782 942 Thread 8 1236 1060 887 73.7 46.9 3108 2783 943 Linux Results Next
############################################################################ Linux Up To 64 Threads - 2 and 8 shown Multithreading Single Precision Whetstones 32-Bit Version 1.0 Using 2 threads - Sat Nov 8 14:49:17 2014 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 3653 1330 1329 938 94 42 4600 5853 932 2 3660 1330 1329 938 95 42 4600 5850 936 Total 7312 2660 2658 1877 189 85 9200 11703 1868 MWIPS 7305 Based on time for last thread to finish Multithreading Single Precision Whetstones 32-Bit Version 1.0 Using 8 threads - Sat Nov 8 14:50:13 2014 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 3084 1324 1323 938 72 39 3200 2931 591 2 3088 1324 1326 938 72 39 3337 2855 592 3 3072 1322 1322 928 72 39 3175 3043 591 4 3073 1310 1280 933 72 39 3233 2884 591 5 3076 1317 1302 929 72 39 3389 2966 591 6 3074 1302 1303 933 72 39 3099 2911 592 7 3076 1319 1302 933 72 39 3202 2913 590 8 3068 1294 1258 936 72 39 3261 2920 590 Total 24612 10513 10417 7468 577 312 25897 23424 4728 MWIPS 24463 Based on time for last thread to finish

Go To Start

CPU Speed Via Assembly Language Add Instructions

The benchmarks use an integer test and a floating point test. They are first executed separately, followed by together in two or more threads, with speeds measured in Integer MIPS or MFLOPS. Below are example log files for a quad core 8 thread Core i7 CPU. Results are available in PC CPUID 1994 to 2013, plus Measured Maximum Speeds Via Assembler Code.pdf. The separate tests indicate three integer MIPS per MHz and (nearly) expected maximum SSE floating point adds of four per clock cycle, also, significantly higher throughput via eight threads, compared to four.

First Windows Versions (obsolete) - cpuidmp.exe and cpuidMP64.exe are included in dualcore.zip with source code in newsource.zip. This covered 1, 2 and 4 threads.

Later Windows Versions - cpuid8thread32.exe and cpuid8Thread64.exe and source code can be found in quadcore.zip . Further details and results are included in quad core 8 thread.htm. Test functions measure performance using 1, 2, 4, 6 and 8 threads.

Linux Versions - cpumaxmp32 and cpumaxmp64 with source code in linux_multithreading_apps.tar.gz . The benchmarks can have an input parameter for 1, 2, 4, 8, 16, 32 or 64 threads (example command ./cpumaxmp32 Threads 8), default being identified count, such as 8 for a quad core CPU with hyperthreading. Further details and results can be found in linux multithreading benchmarks.htm. This variety has separate tests for integer and floating point calculations at the designated thread count.

Windows 1 to 8 Threads - 32 Bit Version CPU ID MP 8 Thread Test 32 bit Version 1.0 Sat May 10 12:11:41 2014 Speed adding to registers Pass 1 Pass 2 Pass 3 Separate Tests 32 bit SSE MFLOPS 15458 15461 15461 32 bit Integer MIPS 12291 12291 12291 Two Threads Equal Priority 32 bit SSE MFLOPS 15460 15460 15461 32 bit Integer MIPS 12290 12292 12292 Four Threads, First Normal Priority, Others Normal - 1 32 bit SSE MFLOPS 15425 15455 15457 32 bit SSE MFLOPS 15449 15455 15449 32 bit Integer MIPS 12273 12190 12283 32 bit Integer MIPS 11866 12194 12290 Total SSE MFLOPS 30874 30910 30906 Total Integer MIPS 24139 24384 24573 Eight Threads, All Normal Priority 32 bit SSE MFLOPS 13237 9434 11840 32 bit SSE MFLOPS 13747 9695 13896 32 bit SSE MFLOPS 8731 11788 11824 32 bit SSE MFLOPS 9154 15443 13920 32 bit Integer MIPS 6171 7072 6624 32 bit Integer MIPS 6353 6054 6802 32 bit Integer MIPS 6743 6983 6604 32 bit Integer MIPS 6809 6239 6833 Total SSE MFLOPS 44869 46360 51480 Total Integer MIPS 26076 26348 26863 ############################################################################ Linux Multithreading Add Test 64 bit Version 1.0 Fri Oct 20 16:53:05 2017 Integer Additions 8 Threads SSE Floating Point Additions 8 Threads Thread 4 - 6350 64 bit Integer MIPS Thread 3 - 7773 32 Bit SSE MFLOPS Thread 2 - 6196 64 bit Integer MIPS Thread 8 - 7763 32 Bit SSE MFLOPS Thread 7 - 6181 64 bit Integer MIPS Thread 5 - 7755 32 Bit SSE MFLOPS Thread 3 - 6169 64 bit Integer MIPS Thread 2 - 7752 32 Bit SSE MFLOPS Thread 8 - 6145 64 bit Integer MIPS Thread 4 - 7742 32 Bit SSE MFLOPS Thread 5 - 6077 64 bit Integer MIPS Thread 7 - 7737 32 Bit SSE MFLOPS Thread 6 - 6047 64 bit Integer MIPS Thread 1 - 7726 32 Bit SSE MFLOPS Thread 1 - 5990 64 bit Integer MIPS Thread 6 - 7681 32 Bit SSE MFLOPS Total - 49155 64 Bit Integer MIPS Total - 61929 32 Bit SSE MFLOPS Aggregate - 47924 64 Bit Integer MIPS Aggregate - 61449 32 Bit SSE MFLOPS Aggregate based on last to finish Tot 4 Threads 33748 61549

Go To Start

BusSpeed MP Benchmark

This version uses integer AND instructions to a single register, streaming data from caches or RAM. First test reads one word with a 32 word address increment for the next word. That is 128 bytes with 32 bit words and 256 bytes with 64 bit words. Address increment reduces for following tests to one word (ReadAll) - all via C. Last test reads all as 16 byte SSE2 data, using assembly code. The benchmark is intended to demonstrate bus operation and speed where data is transferred in bursts and maximum data transfer speed. On the latest systems, multiple programs or threads are clearly needed for maximum throughput.

First Windows Versions (obsolete) - busmp.exe, busMP64.exe and busMP64Int32.exe are included in dualcore.zip with source code in newsource.zip results in busspd2k results.htm

Later Windows Versions (1 to 8 threads) - bus8thread32.exe and bus8thread64.exe, also in quadcore.zip . with further details included in quad core 8 thread.htm. The following 64 bit example results include some for 32 bit tests, where SSE2 functions are from the same code, but 32 bit words are used for the integer tests, instead of 64 bits. For ReadAll cache based tests, CPU speed (MIPS) tends to be the same, with double data transfer speeds at 64 bits. Then, using RAM, bus and memory speeds become the limiting factors.

Linux Versions MPbusspeed32, MPbusspeed64, MPbusspeed32V2 and MPbusspeed64V2 - can be found in linux_multithreading_apps.tar.gz . They have the same run time format as the above Linux benchmarks for up to 64 threads. Further details can be found in linux multithreading benchmarks.htm. See Linux comments on the next page.

Windows 1 to 8 Threads - 64 Bit Version MP Bus Speed Test 64 bit Version 2.0 Sat May 10 11:57:03 2014 Part 1 - 1 Thread MBytes/Second 32 bit Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 ReadAll 128bSSE2 6 31565 31291 31178 42042 42508 41978 61606 21375 61610 24 31300 31285 31258 42203 42786 41751 62331 21157 62329 96 5375 5559 5793 11083 20009 34332 40516 20363 40673 384 5562 5658 5864 11338 19966 33244 39317 20679 38413 768 5331 5391 5505 10966 19403 32805 37871 20680 38718 1536 5364 5427 5508 10779 19355 33166 37951 20679 38331 16380 1070 1356 1955 4248 8046 16688 16838 14103 16757 131070 1034 1272 1866 4023 7724 16029 15980 13852 15963 Part 2 - 2 Thread MBytes/Second 6 63147 62371 62552 83983 85074 83689 123233 42597 123206 24 62579 62580 62188 84353 85351 83515 124250 42252 124188 96 10779 10875 11473 21904 39332 67550 80624 40717 80597 384 10088 11391 11560 22649 39705 67022 78033 41352 76206 768 10574 10610 11042 21889 38669 65967 76066 41356 77275 1536 10442 10637 10901 21597 38467 66046 75829 41353 76302 16380 1798 2305 3397 7161 13913 28647 28743 25980 28471 131070 1780 2310 3424 7193 13808 28589 28617 26066 28578 Part 3 - 4 Thread MBytes/Second 6 116410 124710 92330 167023 148833 165596 245603 70155 238644 24 124722 124658 96440 143956 153894 165793 248402 67455 225894 96 21213 21636 20486 39631 73042 115995 159914 74935 123866 384 21720 22354 22996 44788 79335 111720 155599 76795 128063 768 18098 19577 21168 41296 71833 128568 126837 75598 138878 1536 13887 19117 20564 37334 73001 126388 143677 74219 129958 16380 2113 2780 4682 9428 18500 36759 37534 36126 37098 131070 2109 2598 4681 8806 18112 37049 37477 36384 35472 Part 4 - 6 Thread MBytes/Second 6 118438 106222 105201 161860 157529 178558 295443 88245 309920 24 89228 71127 80985 110402 127049 167495 216617 87712 228035 96 17634 19432 18990 38043 68990 111843 134485 83460 143356 384 18645 18932 19929 42970 76220 123858 138682 83237 142146 768 18072 17529 19655 40544 65312 124557 132566 79248 141036 1536 14363 16097 18084 35815 59434 104533 128989 73640 123287 16380 2043 2763 4568 9273 18501 36749 36798 36852 36663 131070 2082 2689 4508 9093 18033 35246 36318 36347 36784 Part 5 - 8 Thread MBytes/Second 6 124479 125263 124774 196833 206725 212245 392939 107411 402166 24 53893 57161 59948 89256 129520 173683 263250 100645 259380 96 21217 21589 22492 44013 84359 147831 165343 98906 164050 384 21016 21622 22726 43780 80221 147095 165442 98539 161937 768 19382 20258 21737 42635 80814 144343 159745 98558 160982 1536 9986 10664 12858 24661 49622 83158 93985 60140 92112 16380 2074 2748 4525 9123 18245 36548 36486 36504 36414 131070 2072 2759 4525 9123 18216 36571 36445 36481 36443 Linux Results Next

Go To Start

Linux Results

Windows versions, and initial Linux programs, arranged for all threads to start by reading data from the beginning. This did not appear to raise any issues via Windows but it clearly did so using Linux. This became particularly noticeable on later CPUs, such as the Core i7 reported on here, with a 10 MB shared L3 cache. Maximum memory data transfer speed of this PC is 51.2 GB/second.

The first results below are for Version 1, single thread, 64 bit and 32 bit, with performance similar to the Windows versions, that is faster integer MB/second via caches at 64 bits. The other results are for 64 bit Version 2, where performance is quite similar to the Windows (Version 1) speeds. {Ignore 6 KB speeds - needs a longer test] The last two columns are for Linux Version 1 results, where RAM speeds are shown to be faster than the 51.2 GB/second specification, due to caching effects. In Version 2, each thread reads all the data but at staggered starting points and additional RAM is read, to prove the point. Now a maximum of 40.6 GB/second is shown, at 4 threads, 2.2 times faster than that with one thread.

MP Bus Speeds 64 bit Version 1.0, 1 Threads, Sun Oct 22 14:03:08 2017 32 bit Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 ReadAll 128bSSE2 6 31168 31240 31189 42408 43371 43670 61517 20443 61544 24 31267 31251 31217 42139 43348 43816 62254 20787 62259 96 13627 14374 15240 24228 32977 40497 60299 20459 60286 384 5556 5707 5797 11305 20134 34224 39990 20366 41534 768 5348 5442 5555 10923 19356 33585 38201 20385 38255 1536 5311 5421 5555 10924 19385 33698 38255 20421 38362 16380 1240 1564 2130 4671 9149 18280 18843 16515 19109 131070 1201 1469 2098 4573 8500 18137 18128 16472 17808 393210 1155 1453 2098 4557 8112 18145 17813 15913 18024 MP Bus Speeds 64 bit Version 2.0, 1 Threads, Sun Oct 22 14:06:58 2017 Version 1 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 Read All 128bSSE2 6 31594 31270 31258 41133 36625 41267 61563 43670 61517 24 31283 31252 31211 42440 38461 42184 62258 43816 62254 96 14896 15334 15560 24390 32204 39245 60395 40497 60299 384 5703 5835 5988 11721 20542 34338 40726 34224 39990 768 5389 5468 5563 10924 19334 33585 38159 33585 38201 1536 5365 5453 5564 10925 19339 33598 38187 33698 38255 16380 1285 1562 2179 4795 8882 18631 19165 18280 18843 131070 1225 1453 2096 4460 8528 18195 18187 18137 18128 393210 1225 1454 2077 4477 8703 18188 18059 18145 17813 786420 1230 1450 2051 4576 8571 17598 18190 1572840 1216 1486 2086 4583 8647 18109 18427 2 Threads 6 29512 30019 59608 60056 69436 80271 102281 84136 123044 24 59225 59487 58806 83693 75177 83373 124495 86728 121640 96 20250 21156 21937 38565 59794 76975 120371 80333 121121 384 10653 10963 11272 21556 38987 59334 80732 65431 82328 768 10087 10384 10637 19731 36985 63797 75626 63587 76116 1536 10103 10435 10729 20807 37071 63898 76338 63838 76340 16380 2628 3222 4158 8358 15989 32486 33558 32248 33552 131070 1968 2585 3803 8004 15471 31863 32579 32166 33354 393210 1969 2594 3825 7570 15511 31911 32714 32125 33558 786420 1966 2592 3722 7989 15429 32025 32676 1572840 1970 2593 3839 8112 15467 32103 32767 4 Threads 6 25920 29754 58965 64123 95935 147826 260224 167038 205273 24 114028 118093 119688 117904 114405 163844 244665 173073 243044 96 42412 42912 43013 75571 119669 154540 240629 160370 241160 384 20903 21781 22653 42992 77420 128661 163280 127537 159648 768 19201 19029 20653 39706 72719 117327 151481 125191 151515 1536 18637 19725 20659 39744 73196 101971 151967 125482 151584 16380 6026 6764 8179 14740 28176 54888 58802 57175 61785 131070 2034 3088 5019 10004 19712 38982 40418 52960 61816 393210 2033 3099 4303 10048 19856 39126 40572 57642 53405 786420 2068 3092 5050 10077 19819 39096 40628 1572840 2032 2858 4348 9412 19851 39157 39699 8 Threads 6 10245 11452 24238 46432 91436 85135 216659 151955 278208 24 42877 46747 90912 92228 124711 142776 283743 150852 298146 96 36838 44259 43458 80107 122566 136226 193969 138749 276197 384 23488 22078 28973 53186 85603 138786 176651 122014 206507 768 21820 25828 27393 38557 79105 149178 190188 95956 162380 1536 20182 21804 25304 40594 72493 120503 155289 112027 177947 16380 6786 7686 9822 19679 35524 59894 73745 64625 65317 131070 3015 3832 4361 9619 19162 39564 38654 47164 46280 393210 2390 3176 4901 9995 19884 39652 42583 50841 51818 786420 2300 3045 4821 10165 19444 38217 38839 1572840 2032 2992 4792 9680 19259 38238 38778

Go To Start

RandMem MP Benchmark

The program uses the same code for serial and random access, via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from all caches and RAM. This benchmark uses data from the same array for all threads, but starting at different points. All indexing and arithmetic is carried out using 32 bit integers, leading to 64 bit and 32 bit compilations producing the same performance, subject to variations caused by short running times. Only 64 bit results are provided below.

First Windows Versions (obsolete) - RandMP32.exe and RandMP64.exe are also in dualcore.zip with source code in newsource.zip and further details in randmem results.htm .

Later Windows Versions (1 to 8 threads) - Rand8Thread32.exe and Rand8Thread64.exe are available in quadcore.zip . with further details included in quad core 8 thread.htm.

Linux Versions MPrandmem32 and MPrandmem64 can be found in linux_multithreading_apps.tar.gz . They have the same run time format as the above Linux benchmarks for up to 64 threads. Further details can be found in linux multithreading benchmarks.htm.

The Linux benchmark has additional Mutex tests that restrict updating access to one thread at a time. The effect appears to produce some faster speeds with cached data but slower from RAM. With the other procedures, multithreading performance gains and losses are different between the Windows and Linux compilations.

Windows 1 to 8 Threads - 64 Bit Version RandMP 8 Thread Write/Read Test 64 bit Ver. 2.0 Sat May 10 14:38:49 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 30475 30602 18350 17605 17557 17595 12405 11243 Serial RW 30195 30175 22013 17576 17469 17482 11531 10642 Random RD 28916 29109 13124 8290 6319 5669 1308 655 Random RW 30726 29935 9498 6161 5232 4813 1185 608 2 Threads Total Serial RD 61153 60840 36843 35338 35231 35297 23339 21157 Serial RW 21994 21510 20967 21670 33037 33428 23256 21508 Random RD 57862 57902 26154 16611 12484 11248 2622 1302 Random RW 3761 4658 5132 6599 6963 7114 2399 1282 4 Threads Total Serial RD 116765 120919 73499 62205 70973 70902 45280 41568 Serial RW 20776 31110 38023 42715 43836 65636 47000 42876 Random RD 110503 115241 52011 32996 24884 22294 5247 2540 Random RW 3324 6532 8197 11159 11724 12507 4747 2494 8 Threads Total Serial RD 111370 114213 95358 92240 89120 87557 74104 63754 Serial RW 28212 37141 54805 64501 56425 72723 70007 49286 Random RD 108353 110797 59991 41932 32190 14669 4878 2897 Random RW 5150 8024 9153 17569 15918 13841 4661 2528 Linux 1 to 8 Threads - 64 Bit Version RandMemMP Speeds 64 Bit Version 1, X Threads, Sun Oct 22 15:00:43 2017 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB 1 Thread Serial RD 27991 27801 20258 19249 19249 19294 12477 11683 Serial RW 29969 30241 21896 17829 17494 17499 12085 11565 Random RD 27484 27463 13589 8257 6220 5604 2471 1011 Random RW 30364 30075 9168 6108 5177 4783 2804 982 Mutex SRW 29982 30245 21897 17762 17433 17432 12130 11529 Mutex RRW 30361 30071 9176 6108 5175 4782 2772 982 2 Threads Serial RD 40622 55523 40299 38028 37866 37878 23094 22142 Serial RW 14539 21855 20979 22448 31456 25642 24743 18109 Random RD 40316 54307 26840 16365 12340 11092 4747 1913 Random RW 3039 4599 5107 6570 6943 7115 4904 1773 Mutex SRW 15294 29770 21777 17761 17385 17130 12099 11298 Mutex RRW 22396 29829 9251 6098 5174 4779 2817 970 4 Threads Serial RD 39300 106376 80250 75904 75310 75408 43206 37738 Serial RW 15182 31547 35603 38859 45426 60180 48848 20287 Random RD 72790 104282 52951 31312 12640 21975 6813 3317 Random RW 2582 5910 8171 11159 9140 12510 9591 3261 Mutex SRW 20566 29383 21517 18150 16703 16945 11798 11177 Mutex RRW 22006 29629 8880 5881 5035 4666 2702 967 8 Threads Serial RD 37987 76974 96575 94809 88112 88170 66556 60949 Serial RW 9030 29524 52796 47811 52557 69516 68200 25318 Random RD 37120 76419 65662 32215 24619 22463 13226 3346 Random RW 2013 6036 9032 17133 16426 15039 11082 2829 Mutex SRW 8207 17043 20147 17135 16675 16621 11714 10827 Mutex RRW 9865 20828 8613 5574 4889 4567 2676 951

Go To Start

MP MFLOPS Benchmarks

The benchmarks carry out calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word, using data sizes of 0.1, 1 .02 and 10.2 million words. Each thread deals with separate segments of the data, via shared code. Both 32 bit and 64 bit versions have been produced, with results in single precision MFLOPS.

Windows Versions - MPmflops32.exe using 32 bit instructions, MPmflops64.exe with SSE instructions, MPmflopsc2.exe a later 64 bit SSE compilation for full SIMD operation and MPmflopsAVX.exe a 64 bit compilation using /arch:AVX option. The benchmarks and source code are available in gigaflops-benchmarks.zip, with further details and results in GigaFLOPS Benchmarks.htm All were compiled from the same code to handle up to 64 threads (Command Format Example - MPmflopsc2 Threads 8).

Linux Versions - MPmflops32, MPmflops32SSE and MPmflops64, where benchmarks and source code are also in linux_multithreading_apps.tar.gz , again for up to 64 threads. Further details and results can be found in linux multithreading benchmarks.htm. Later MPmflops64AVX was produced and is in AVX_benchmarks.tar.gz, with details in Linux AVX_benchmarks.htm.

Results for runs on Windows and Linux are below. The first is from compilation for old i87 32 bit floating point. The second had a compiler directive to use SSE functions, but only achieved Single Instruction Single Data (SISD) operation, using one word out of the 4 word registers, and slightly faster during the early tests. The third results, with an AVX compiler directive, generated the appropriate vector instructions, but applied to SSE 128 bit registers, to produce the same performance as the SSE tests.

Maximum SSE MFLOPS per core are equal to CPU MHz x 4 (128 bit SSE register width) x 2 (linked multiply and add) or 31.2 GFLOPS for the Core i7 considered here, giving 124.8 GFLOPS for four cores. The 256 bit AVX registers double this score. Both Windows and Linux programs demonstrated respectable performance of more than 90 GFLOPS for SSE and the Linux Benchmark near 180 GFLOPS using AVX instructions.

Windows MFLOPS 1 to 16 Threads Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Threads Core i7 4820K 1 3867 3853 3386 6085 6054 6017 5830 5824 5809 256 KB x 4 L2 2 7737 7731 6618 12160 12165 11991 11653 11648 11650 4 core 8 Thrd 4 15433 15459 9833 23487 24291 23886 22666 23175 23220 3900 MHz i87 8 15359 15395 9846 23554 23708 23586 23418 23464 23416 Windows i87 16 15145 15192 10023 23422 23536 22966 23241 23401 23282 Core i7 4820K 1 5004 4960 4192 6188 6182 6135 5890 5890 5887 256 KB x 4 L2 2 9996 10002 8049 12371 12354 12282 11770 11779 11744 4 core 8 Thrd 4 19923 18532 9866 23946 24704 24347 23219 23531 23497 3900 MHz 8 19602 19776 9820 24683 24648 24634 23521 23497 23506 Windows SISD 16 18727 19077 10073 24316 24243 24442 23469 23393 23385 Core i7 4820K 1 10116 9864 5852 24636 24436 19881 23353 23389 23243 256 KB x 4 L2 2 26453 19851 9189 49181 49223 34969 46653 46759 46414 4 core 8 Thrd 4 41845 26975 10063 85909 93852 40163 89202 90572 87329 3900 MHz 8 58734 43723 9980 97139 98446 40062 91320 93885 93125 Windows SIMD 16 57731 42194 10178 94166 93338 40074 90162 92102 93496 Core i7 4820K 1 10046 9901 5906 24629 24382 19832 23411 23361 23246 256 KB x 4 L2 2 26634 19679 9250 49194 49267 35183 46788 46788 46382 4 core 8 Thrd 4 52424 39057 10092 60266 98220 39744 90948 90611 92515 3900 MHz 8 58601 43529 10032 85198 98220 40162 93810 93866 93745 Windows AVX 1 16 57098 42920 10319 86267 95243 40427 92929 92995 92356 Linux MFLOPS 1 to 8 Threads Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Threads Core i7 64 bit 1 9681 9759 5990 24533 24570 19975 23269 23307 23052 4820K 4 45340 21688 9237 49320 49918 36638 46942 89676 91029 Linux SIMD SSE 8 54621 41832 10026 92086 92352 39982 92408 93282 92050 Core i7 64 bit 1 12542 11404 5991 35982 36180 23299 46400 46572 44729 4820K 4 62273 23031 8970 159040 80096 40124 90572 91058 88877 Linux SIMD AVX 8 60258 44329 9977 173224 151909 40153 173372 177831 158594

Go To Start

OpenMP and QPAR Benchmarks

The benchmarks carry out the same calculations as MP MFLOPS Benchmarks above, but without multithreading code, where main loops are preceded with a "#pragma omp parallel for" directive and, in some cases, a compile parameter.

Windows OpenMP Benchmarks - OpenMP32MFLOPS.exe, SSE32MFLOPS.exe (same code no OpenMP directives) and OpenMP64MFLOPS.exe are included in openmpmflops.zip. Further details and results are included in openmp mflops.htm. Different OpenMP benchmarks are covered in openmp speeds.htm.

With Visual Studio 2012, Microsoft added QPAR, Auto-Parallelizer, to the compiler, that can automatically generate multiple threads in the same way as OpenMP. The benchmark QparMP64MFLOPS.exe was produced, with execution and source files included in gigaflops-benchmarks.zip, with details and results in GigaFLOPS Benchmarks.htm and quad core 8 thread.htm.

Linux Original Versions - openMPmflops32, openMPmflops64, notOMPmflops32 and notOMPmflops64, from linux openmp.tar.gz with details in linux openmp benchmarks.htm. Then there are Later Versions - openMPmflops64, notOMPmflops64 and openMPmflops64AVX in AVX_benchmarks.tar.gz, with details in Linux AVX_benchmarks.htm.

Results below are again from benchmarking the 3.9 GHz Core i7.

Windows OpenMP64MFLOPS.exe provides similar speeds to 64 bit MP-MFLOPS SISD at 32 operations per word, otherwise it is slower.

QparMP64MFLOPS.exe obtains similar 4 thread performance as MP-MFLOPS SIMD. QPAR appears to provide a better alternative than OpenMP but, overall, hand coded multithreading seems to be the best option.

Linux notOMPmflops64 V1 and V2 achieve similar speeds as the single thread MP-MFLOPS benchmark, but not so, compared to the 4 thread test, and particularly the one using 8 threads.

openMPmflops64AVX performance is generally inferior to that from Linux MPmflops64AVX.

Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Windows SSE32MFLOPS.exe 4898 4845 4171 5824 5994 6094 5796 5829 5795 OpenMP32MFLOPS.exe 6511 9290 9119 14351 17324 17592 21454 22884 22850 OpenMP64MFLOPS.exe 8420 12440 9483 18477 23210 23737 22134 18281 19690 QparMP64MFLOPS.exe 1 Thread 9691 9454 5743 23214 23126 19033 22700 23541 23405 2 Threads 23972 18673 9177 44855 44919 33868 44070 45733 46419 4 Threads 43356 36007 10084 76380 91259 40349 85300 81803 69212 8 Threads 44741 33966 9732 81506 73857 36635 87736 91170 87086 Linux notOMPmflops64 V1 10093 9803 5919 24634 24651 20097 23519 23520 23339 openMPmflops64 V1 9084 12363 8089 22273 23039 22432 22683 23195 23096 notOMPmflops32 V2 3884 3886 3612 6145 6151 6067 5837 5835 5830 openMPmflops32 v2 9483 12481 8628 22347 23032 22742 22691 23247 23126 notOMPmflops64 V2 9879 9772 5934 24500 24529 20039 23285 23290 23090 openMPmflops64 V2 11163 20322 9180 45392 49695 33927 21534 22477 22476 openMPmflops64AVX 19713 37822 9219 94036 68725 36923 22761 23133 23019

Go To Start

MultiThreading Benchmarks

Roy Longbottom

Contents

General

Whetstone MP Benchmark

CPU Speed Via Assembly Language Add Instructions

BusSpeed MP Benchmark

Linux Results

RandMem MP Benchmark

MP MFLOPS Benchmarks

OpenMP and QPAR Benchmarks